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ABSTRACT 

The growing popularity of online social networks has pro- 
vided researchers with access to large amount of social net- 
work data. This, coupled with the ever increasing com- 
putation speed, storage capacity and data mining capabil- 
ities, led to the renewal of interest in automatic commu- 
nity detection methods. Surprisingly, there is no univer- 
sally accepted definition of the community. One frequently 
used definition states that "communities, that have more 
and/or better-connected 'internal edges' connecting mem- 
bers of the set than 'cut edges' connecting the set to the rest 
of the world" [To]. This definition inspired the modularity- 
maximization class of community detection algorithms, which 
look for regions of the network that have higher than ex- 
pected density of edges within them. We introduce an alter- 
native definition which states that a community is composed 
of individuals who have more influence on others within the 
community than on those outside of it. We present a math- 
ematical formulation of influence, define an influence-based 
modularity metric, and show how to use it to partition the 
network into communities. We evaluated our approach on 
the standard data sets used in literature, and found that it 
often outperforms the edge-based modularity algorithm. 

Keywords 

community structure, automatic detection, social networks, 
global influence, modularity , eigenvectors 

1. INTRODUCTION 

Communities and social networks have been a source of 
interest for researchers for several decades [3j[7]. However, 
one of the main problems faced by the early researchers was 
the difficulty of collecting acquaintanceship and related em- 
pirical data from human subjects [3j. The advent of the 
internet and the growing popularity of online social net- 
works changed that, providing the researchers access to huge 
amount of invaluable human social network data. This, cou- 
pled with the ever increasing computation speed, storage ca- 
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pacity and data mining capabilities, led to the reemergence 
of interest in the social networks in general, and community 
detection methods specifically. 

Despite a long history of investigation, surprisingly, there 
is not a single universally accepted definition of the commu- 
nity. A definition preferred by sociologists is that a commu- 
nity is composed of individuals who are similar to one an- 
other in some way, whether it is because they see the same 
friends or belong to the same organizations. This definition 
inspired the class of community-finding methods based on 
hierarchical clustering. These algorithms assign nodes to 
the same community if they are sufficiently similar to each 
other. Similarity measures include structural equivalence, 
where two nodes are said to be equivalent if they have the 
same set of neighbors, and approximate equivalence that 
uses Euclidean distance and Pearson correlation. Another 
similarity measure used in hierarchical clustering methods is 
the number of paths between nodes. Hierarchical clustering, 
however, may not assign every node to a non-trivial commu- 
nity. In addition, it does not provide a measure of how good 
a particular division of the network into communities is. 

Physicists and computer scientists prefer to define com- 
munity as "a group of vertices in which there are more 
edges between vertices within the group than to vertices 
outside of it" [2]. This definition helped inform a variety of 
graph-based approaches to automatic community detection, 
including graph partitioning and modularity optimization 
techniques. Graph partitioning algorithms [5] [l7] attempt 
to minimize the number of edges running between commu- 
nities. One of the main disadvantages of these methods 
is that either the number of communities has to be speci- 
fied a priori, or they repeatedly bisect the graph without 
a well-defined stopping point. Since it is almost impossi- 
ble to always know beforehand the number of communities 
within a large network, these methods are unable to auto- 
matically detect natural communities. Furthermore there 
is no guarantee that the communities into which we have 
divided the network represent the best possible community 
division of the network. Newman and his colleagues real- 
ized that rather than minimize the number of edges running 
between groups, one should instead look for groups that 
have higher than expected number of edges within them 
and lower than expected edges between them 12 
1 1 3] - These algorithms maximize a measure called modular- 
ity, which is the fraction of all edges within communities 
minus the expected value of the same quantity. The modu- 
larity optimization method is fast (if approximate), and can 
be applied to both undirected and directed graphs. It is 



able to find the "best" assignment of nodes to communities, 
although each node can belong to only a single community. 
Some researchers have recently questioned the applicabil- 
ity to real-world networks of the edge-density definition of 
the community and the modularity optimization techniques 
based on it. Leskovec et al. To] found that in large networks, 
communities tend to 'blend' into the giant connected com- 
ponent, making it impossible to extract any but the trivial 
small and tightly knit communities. 

We stake a claim in this active field by introducing an 
alternative definition of community that is based on in- 
formation spread on networks. We claim (without much 
theoretical or empirical support) that a community is com- 
posed of individuals who have more influence on individuals 
within the community than on those outside of it. We take 
a structure-based view of influence, defining it as the num- 
ber of paths, of any length, that exist between two nodes. 
The more paths there are, the more opportunities one node 
has to affect the other. This will result in the actions of 
the community members becoming correlated with time, 
whether through adopting a new fashion trend or vocab- 
ulary terms, watching a movie, or buying a product. We 
define influence-based modularity metric, and show how to 
use it to partition a network into communities. We evaluated 
our approach on the standard data sets used in literature, 
and found that it gives at least as good performance as the 
standard modularity-based algorithm. 

The paper is organized as follows. In Section [2] we define 
and give a mathematical derivation of influence. Section [3] 
describes our re-definition of the modularity metric in terms 
of influence, and shows how the new modularity can be used 
for automatic community detection. We present results of 
applying our approach to well-studied networks in Section[4] 
In Section [5] we compare our approach to those that have 
previously been described in literature, and conclude with 
Section [6] 

2. A MEASURE OF GLOBAL INFLUENCE 

A network of N nodes and E links can be represented 
using a graph G(N,E), where N is the number of vertices 
of the graph representing the nodes of the network, and E 
is the number of edges of the graph. Edges are directed; 
however, if there exists a an edge from vertex i to j and 
also from j to i, it is represented as an undirected edge. A 
path p is an n hop path from vertex i to j, if there are n 
vertices between the vertex i and vertex j along the path. 
We allow the paths to be non-selfavoiding, meaning that 
the same pair of vertices could be traversed more than once 
on the path. The graph G(N, E) can be represented by an 
adjacency matrix A whose elements Aij are defined as 



Aij — 



1 if 3 an edge from vertex i to j 
otherwise. 



We introduce an index for measuring the degree of in- 
fluence a node has on other nodes. We use this index to 
divide the network into communities so that nodes which 
have higher influence on each other are grouped together. 
At the same time this index could also be used to find out 
the status of the people in the community based on their 
influence [7]. 

Influence can be defined as the capacity to have an effect 
on someone. Pool and Kochen [3] state that "influence in 
large part is the ability to reach a crucial man through the 
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Figure 1: Connectivity:Edge Connectivity and Path 
Connectivity 



right channels, and the more the channels in reserve the 
better." This is the measure of global influence that we 
employ, and we also adopt the concept of attenuation when 
transmitted through intermediaries [7| . Therefore, influence 
depends not only on direct contact between people, but also 
on the number of ways an individual can reach another, 
or the number of n hop paths between them. Hence, the 
influence of node a on 6 is likely to be more if there are 
more paths from a to 6. 

The strength of the effect via longer paths with more in- 
termediaries is likely to be lower than via shorter chains with 
fewer intermediaries. We model the attenuation of influence 
over longer chains through two parameters a and (3. We use 
two parameters, rather than a single parameter, to model 
the fact that a node may have more influence over its direct 
neighbors, than it will have over the neighbor's neighbors, 
and so on. Thus, j3 (0 < f3 < 1) is the direct attenuation 
factor, the probability that the effect will be transmitted 
to the immediate neighbors of the node, a (0 < a < 1) 
is the indirect attenuation factor, the probability that the 
effect will be transmitted through links other than those to 
the node's immediate neighbors (i.e., via friends of friends). 
Let us consider transmitting an effect or a message from 
node b to node c in a network in Figure [l] The probability 
of transmission to the immediate neighbors of b is (3. The 
probability of transmission over the five 1-hop paths is /3a. 
In general, the probability of a transmission along an n-hop 
path is (3a n ~ 1 . Note that (3 — a is a special case when the 
transmission probability along all links is the same. 

The total influence of node b on node c is thus depen- 
dent on the number of (attenuated) channels between b and 
c, or the sum of all the weighted paths from node b to c. 
This definition of influence makes intuitive sense, because 
the greater the number of paths between b and c, the more 
opportunities there are for b to transmit messages to c and 
to affect what c is doing. 

We represent total number of links from node i to node 
• which is given by the elements A t j of the 



J as 



' 3 



adjacency matrix A. Next, we represent the total number 
of 1-hop paths from node i to node j as ^ 1 j , and it 

is given by Ylk=i AikAkj, since a path can exist from i to j 
in one hop via a particular node k iff 3 an edge from i to k 



and also from k to j. Summing over all k £ N, we get the 
above result. We define matrix Ai = A ■ A whose elements, 
-Aiy = Y2k=i AikAkj, give the the total number f-hop paths 
from i to j. 

Similarly the total number of 2-hop paths from node i to 



node j is represented as 



■3 



and is given by 



We define matrix A2 = A ■ A • A whose elements Aza = 



• Aij give the the total number 2-hop 



paths from i to j. 

Generalizing the total number of chains from node i to 



node j with n intermediaries j is represented as j . 
is by the matrix A n where 



n+l times 



and 



(1) 



A n — A ■ A - ■ ■ A — A (n _i) ■ A 

Adding weights to take into account the attenuation of 
effect of node i on node j, we get total influence of node i 
on j as 



+ /3a i 
+ pa 



■3 +■ 



-3 



We represent the measure of influence of nodes on other 
nodes by the influence matrix P where 



: /3A + /3a A 1 H h /3a n A n + ■ 



(2) 



After elementary manipulations, this series can be rewritten 
as 



P = /3A(I - aA) 



(3) 



where I is the identity matrix. This equation holds while 
a is less than the reciprocal of largest characteristic root of 
adjacency matrix A [4]. 

The influence matrix captures the effective connectedness 
of a node not only in terms of the number of nodes it is di- 
rectly connected to, but also in terms the number of nodes 
it is indirectly connected to. This formulation is mathemat- 
ically similar to the weights between vertices used in the hi- 
erarchical clustering algorithm of Girvan and Newman [6], 
where the weights depended on the total number of paths 
between nodes. Rather than using influence to measure sim- 
ilarity between nodes, as done in that work, we will use it 
to find groups of nodes that exert higher than expected in- 
fluence on each other. 

3. COMMUNITIES AND INFLUENCE 

The objective of the algorithms proposed by Newman and 
coauthors was to discover "community structure in networks 
— natural divisions of network nodes into densely connected 
subgroups" [l5] . They proposed modularity as a measure for 
evaluating the strength of the discovered community struc- 
ture. Algorithmically, their approach to discovering network 
structure is based on finding groups with higher than ex- 
pected edges within t hem and lower than expected edges 
between them 12 111 14 13 . The modularity Q, which is 



optimized by the algorithm is given by: 

Q =(fraction of edges within community)- (expected fraction 



of such edges) . 

Thus, they use Q as a numerical index to evaluate a partic- 
ular division of the network. The underlying idea, therefore, 
is that connectivity of nodes belonging to the same commu- 
nity is greater than that of nodes belonging to different com- 
munities, and they take the number of edges as the measure 
of connectivity. But is edge connectivity the true measure 
of connectivity on the network? 

Consider again the graph in Figure [I] where there exists 
an edge between a and c but not between b and c. However, 
clearly c is not unconnected from b, as there exist several dis- 
tinct channels for b to send information to, or influence, c. 
The influence matrix that we defined above, gives a math- 
ematical model of the global connectivity of the network. 
We will use this connectivity to identify communities in the 
network. 

3.1 Influence-based Modularity 

We redefine modularity Q that as 
Q — (connectivity within the community) - ( expected con- 
nectivity within the community) 

and adopt the influence matrix P as the measure of con- 
nectivity. This definition of modularity implies that in the 
best division of the network, the influence of nodes within 
their community is more than their influence outside their 
community. A division of the network into communities, 
therefore, maximizes the difference between the actual in- 
fluence and the expected influence within the community, 
given by the influence in an equivalent random graph. Let 
us denote the expected influence by a JV x Af matrix P. 
Modularity Q then can be expressed as 



(4) 



where s; is the index of the community i belongs to and 



8(si, Sj 



otherwise. 



When all the vertices are placed in a single group, then it is 
axiomatically assumed that Q — 0. Thus we have [Pij ~ 
Pij] = 0. Hence, the total influence W is 



(5) 



Hence the null model against which we compare our network 
has the same number of vertices N as the original model, 
and in it the expected influence of the entire network equals 
to the actual influence of the original network. 

We further restrict the choice of null model to that where 
the expected influence W] n on a given vertex j from all other 
vertices is equal to the actual influence on the corresponding 
vertex in the real network. 



W) 



(6) 



Similarly, we also assume that in the null model, the ex- 
pected influence W° ut of a given vertex i on all other ver- 
tices is equal to the actual influence of the corresponding 
vertex in the real network 



(7) 



The null model of this class that we then consider has paths 
that are placed at random between vertices subject to the 
constraints of Equation(|6| and Equation Q. This implies 
then that the expected influence Pij of vertex i on vertex j 
can be written as 

p ii =h{wr t )f2{wr), (8) 

where /i and fa are some functions. We rewrite Equation|7| 



j j 

for all i, and hence 



h{w° ut ) = CiW° ut 



(9) 
(10) 



for some constant C\. 

Along the same lines we have 

i i 

for all j, and hence 



C 2 W]' 



(11) 

(12) 



for some constant Ci- Therefore, expected influence is 

E-^ = J2( CiC2W * outw t) 

ij ij 

= c*ic 2 ^(H/r t H / ; n ) 

ij 
ij 

= dC 2 W 2 
Now using Equation (JsJ we have 

w = E ■ = E ^ = CiC 2 VK 2 , 



(13) 



which we can solve for C1C2. Using Equations [8}]12| we can 
write expected influence as 



p 

r '.i 



w out w in 



W ' 

and the influence-based modularity as 



Q = E 



w° ut w™. 
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(14) 



(15) 



3.2 Detecting Community Structure 

Once we have derived Q, we have to select an algorithm 
to divide the network into communities that optimize Q. 
Like others [TS] [T4] [9], we use the matrix-based approach 
analogous to spectral partitioning. The possible approaches 
that could be then used for community detection include 
leading eigenvector method, vector partitioning method and 
so on. We implemented the leading eigenvector method [l3] . 
We summarize the approach in the Appendix. 

4. EVALUATION ON REAL NETWORKS 

We evaluated our approach by using it to find communi- 
ties on real networks that can be found in literature. 



4.1 Zachary's Karate Club 

We applied this method on the friendship network of Zachary's 
karate club |19|. In this study, Zachary studied the friend- 
ship network of a karate club for two years. During the 
course of the study, a disagreement developed between the 
administrator of the club and the club's instructor, resulting 
in the division of the club into two factions, represented by 
circles and squares in Figurc[2] The natural communities ex- 
isting in the club has been predicted by various community 
detection and graph partitioning algorithms. We used the 
friendship network of Zachary's karate club [l9| to compare 
the performance of the algorithm proposed in this paper to 
Newman's community-finding algorithms. 

Figure [2] presents results of different community-finding 
approaches. Figure [2ja) shows results of the modularity 
maximization-based approach proposed by Newman VVU when 
the network is bisected into two communities only. Fig- 
ure [2jb) shows results of a similar bisection done by our 
algorithm with j3 — 1/N and a = 1/N, where N = 34 is 
the number of nodes. Both methods result in the correct as- 
signment of individuals to communities and are better than 
those produced by the spectral bisection algorithm and hi- 
erarchical clustering, which does not assign all nodes to the 
principal communities [l3]. However, finding natural com- 
munities in the karate club network by iterating each algo- 
rithm until a stopping condition is reached, leads to differ- 
ent results. Newman's method divides the network into four 
communities (Figure |2jc)), while our method divides it into 
three communities (Figure |2jd)). Two of the communities 
generated by Newman's algorithm (shown in pink and red in 
Figure [2|c)) are similar to the two of the three communities 
found by our algorithm. However, it further subdivides the 
circle nodes into putting node 1 into the same community 
as five of its immediate contacts, but a different community 
than nine of its immediate contacts. Our algorithm appears 
to give a more realistic division of the karate club network 
into natural communities. 

4.2 College Football 

We also ran our approach on the US College football data 
from Girvan et al. [€>] 1 1 The network represents the sched- 
ule of Division 1 games for the 2000 season where the ver- 
tices represent teams (colleges) and the edges represent the 
regular season game between the two teams they connect. 
The teams are divided into "conferences" containing 8 to 
12 teams each. Games are more frequent between members 
of the same conference than members of different confer- 
ences leading to a community structure with greater connec- 
tivity within the communities (represented by conferences) 
than between them. Inter-conference games however are 
not uniformly distributed, with teams that are geographi- 
cally closer likely to play more games with one another than 
teams separated by geographic distances. However the as 
the authors state [6] there are some conferences like Sunbelt 
having teams playing nearly as many games against teams 
in other conferences (Western Athletic in case of Sunbelt) 
as they did against teams within their own conference. This 
leads to the intuition, that the conferences then may not be 
the natural communities present in given data, but the nat- 
ural communities may actually be bigger than the the size 



1 The college football data is available at 
http : //www-personal .umich. edu/~mejn/netdata/. 




Figure 2: Results of applying different community finding algorithms to Zachary's karate club network. The 
numbered vertices represent the members of the club and edges represent friendships. The factions in which 
the clubs split up during the course of study are shown by squares and circles, (a & b) Communities found 
after running a single iteration (graph bisection) using (a) Newman's and (b) the proposed algorithms, (c 
&: d) Natural communities found by running (c) Newman's algorithm and (d) the proposed algorith until 
termination condition is reached. 



of the conferences, with conferences playing as many games 
within them as between them being clubbed into the same 
community. How then can evaluate the purity of the natural 
communities detected? 

We define purity as the total pair-wise similarity between 
teams that actually belong to the same conference. Thus, 
the similarity between two teams in a predicted community 
is 1 if they belong to the same actual conference, and it is 
it the two teams belong to different conferences. The max- 
imum total similarity would then be obtained if all teams 
belonging to same conferences end up in the same commu- 
nity. The purity of a prediction is then evaluated by the total 
similarity when teams are grouped in accordance to the com- 
munities predicted by the algorithm divided the maximum 
total similarity. We vary j3 (keeping a constant) and see 
its change in purity of the predicted communities Figure [3] 
The graph (Figure [3ja)) that for a given value of a, purity 
is constant irrespective of the value of /3, and hence purity 
is dependent primarily on the value of a. We next vary a 
keeping /3 constant (/? = 1) and compute the corresponding 
change in purity. Figure [3^b) shows that community purity 
increases with the increase of a, reaching to almost 90% near 



a = 0.1 (the upper bound to a is determined by the recipro- 
cal of the largest eigenvalue of the adjacency matrix). This 
shows that as we increase the attenuated effect of links not 
directly connected to the nodes, the groups become purer 
and it is independent of the attenuated effect of the direct 
links. When a — and /3 = 1, we get influence dependent 
only on direct contacts. Hence modularity in this case re- 
duced to one studied by Newman [l3] , and gives around 72% 
purity on the football data. The number of groups predicted 
changes from 8 at a = to four when a nears 0.1. 

4.3 Political Books 

Next we evaluated the approach on the political books 
data compiled by V. Krebsrl In this network the nodes rep- 
resent books about US politics sold by the online bookseller 
Amazon. Edges represent frequent co-purchasing of books 
by the same buyers, as indicated by the "customers who 
bought this book also bought these other books" feature 
on Amazon. The nodes where given labels liberal, neutral, 
or conservative by Mark Newman on a reading of the de- 



2 http : //www . orgnet . com/ 
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Figure 3: The graph showing the purity of commu- 
nities predicted with different values of a and f3. (a) 
Case when /3 is varied while keeping a constant. We 
see that purity is dependent primarily on the value 
of q. (b) Case when a is varied and /3 = 1. We 
see that as a increases the purity increases reach- 
ing to almost 90% near a — 0.1. This shows that as 
we increase the attenuated effect of links that are 
not directly connected to the nodes, the groups be- 
come purer. When a = 0, the method reduces to 
eigenvector based modularity maximization method 
postulated by Newman [13] . 



Figure 4: The graph shows the purity of the com- 
munities predicted as a is varied ( (5 is kept constant 
at 1) from a = to a = 0.08 (the reciprocal of the 
largest eigenvalue being taken as the upper bound 
for the value of a.) 

scriptions and reviews of the books posted on Amazon^] 49 
of the books were marked as conservative, 43 books were 
marked as liberal and 13 books were marked as neutral. We 
use our algorithm to find the existing community structure 
in the network by varying the parameter a, as shown in Fig- 
ure [4] We see that as the value of a increases, the number of 
communities formed decreases (changing from four at a — 
to two at a = 0.08 and keeping j3 constant). Again the re- 
ciprocal of the largest eigenvalue being taken as the upper 
bound for the value of a. Also the purity of the communities 
detected increases from 60% at a — to as high as 92% at 
a = 0.08. Again note that at a = the method reduces 
to Newman's modularity maximization method. Another 
interesting observation is that when a was taken as 0.08, 
leading to the formation of two groups, six of the neutral 
books were in one group which consisted entirely of conser- 
vative books (52 books of which 46 were those labeled as 
conservative and six as neutral) and seven were in the other 
group (consisting of 53 books of which 43 were labeled lib- 
eral, seven were neutral and three were conservative). This 
indicates the possibility that of the 13 books labeled as neu- 
tral six were conservatively inclined and seven were liberally 
inclined. 

5. RELATED RESEARCH 

Our work is a generalization of the eigenvector based mod- 
ularity maximization method proposed by Newman [13] - 
Taking /3 = 1 and a — reduces the influence matrix to the 
adjacency matrix, and the modularity that our algorithm 
maximizes effectively reduces to the modularity defined by 
Newman [13] . 

The Ra ndom Walk models |18| and the PageRank algo- 
rithm 16 have been some of the more popular ways of ana- 



lyzing the relevance of nodes in a network, and may be used 
for community finding. One way to look at Random Walk 
models in graph G(N, E) is to start from a vertex u and take 
random steps along the edges of the graph. The probability 
of movement from vertex u to v is given by 



T(u,v) = 



if 3 an edge from vertex u to v in G ; 
otherwise. 



3 This data is available at 

http : //www-personal .umich. edu/~mejn/netdata/ 



where d u is the degree of vertex u. This defines a walk 
using transition probability matrix T. The second way to 
look at random walks is to look at probability distribution 
TT t of vertices reached after t steps on traversing the graph 
G. This can be viewed as a probability of being at a vertex 
v G N after time t. Let us assume we start from vertex vo, 
hence the initial probability distribution of the vertex we are 
at is 

/ \ / 1 if v = v ; 
^ = { otherwise. 

The probability distribution of the vertex that we are at 
after time t is given by the probability distribution 7r t and 
hence 

7T t (v)= 537r t _i(w)T(u,v) (16) 

This can be represented using Tv t = n t -iT; therefore, 

n t = vroT* . (17) 

However, for this tool to be useful several factors have to 
be taken under consideration, including the convergence of 
the sequence, the stationarity and stability of the distribu- 
tion, its uniqueness, and so on. If there exists a unique, 
stable, stationary distribution ir, then this would lead us to 

7T = TTT (18) 

Computing the eigenvector of the matrix T with eigenvalue 
1 gives us the value of ir which is how Naive PageRank 
algorithm evaluates the relevance of the nodes of the net- 
work. Along with the property of the existence of a unique, 
stationary, stable distribution, Random Walk with Restart 
considers an additional probability that we can return back 
to our initial state and associates some probability with it. If 
we take /3 as the probability to move at random, and 1 — f3 
as the probability of jumping back to its initial state, the 
Random Walk with restart can be formulated as: 



where = [e^] and 



7T, = /3T7T, + (1 - p)ei 
1 if j = i ; 



(19) 



6lJ 1 otherwise. 

Hence vector -ni — [iTij] gives the relevance score of all nodes 
j G TV with respect to node i. Similarly, along with the prop- 
erty of the distribution being stationary, PageRank with 
restarts considers at each time step t, probability /3 to move 
at random, probability of 1 — /3 to jump to some specific 
state, uniformly at random. Hence the transition matrix in 
this case is modified to T' where each element T/j is given 
by 

2£ = /3Tij + (1 - P)/N (20) 
and then as in Equation |18| pagerank 7r would then be given 
by principal eigenvector of this matrix and hence n would 
similarly be 

7T = TVT' (21) 

In effect existence of a unique, stable stationary distri- 
bution is the fundamental concept behind most variations 
of random walk models and page rank algorithms 



18 16 



in the determination of the unique stable stationary distri- 
bution. We have T = D~ x A where D is the diagonal matrix 
of outdegrees. When G is undirected, the adjacency ma- 
trix A is symmetric, so the corresponding Laplacian is also 
symmetric, guaranteeing favorable properties of the spec- 
trum like the orthonormal basis of real eigenvectors. We 
can symmetrize T by considering a spectrum of T + T T or 
T ■ T T (where T T is the transpose of matrix T), but the 
problem then lies in the graphical interpretation eigenvalues 
without which these approaches are not really useful. In real 
life, and in social networks, there do exist directed graphs 
and as illustrated above it is difficult to apply the random 
walk and page rank models on them. 

If we think of vertices of the random walk graph as states 
of a Markov chain, then the property that governs the lim- 
iting behavior of 7ToT 4 is ergocity and we say that the cor- 
responding Markov chain is ergodic if there exists a unique 
stationary distribution it to which -koT* converges. The nec- 
essary and sufficient conditions of ergodicity of a Markov 
chain are irreducibility and aperiodicity. The Random Walk 
models can be used as a measure of mutual relevance and 
PageRank for relevance scores of individuals. However when 
we consider graphs in real life, especially social networks, 
these conditions are not necessarily satisfied (e.g., isolated 
communities) . 

These algorithms are basically concerned with the flow of 
information on a network. So, if we start from a node with, 
say a unit of information, which it spreads via the channels 
it has (outgoing links), the Random Walk model describes 
the spread of this information in the network when the in- 
formation flow attains equilibrium, and further exchange of 
information among the nodes does not change the distribu- 
tion of information. When we are thinking of the division of 
nodes into communities, we are not interested in the amount 
of information they finally have from each other, but in how 
this information reaches them, i.e., the channels of the flow 
of information. The more the channels for information flow 
a node has, the greater the tendency for the information it 
sends to reach its recipients. In other words, Random Walk 
models and PageRank algorithms are concerned with the 
equilibrium distribution of the flow of information, and we, 
on the other hand, are interested in the channels of infor- 
mation flow and their capacity to spread the information. 

Mathematically the difference between the two approaches 
can be stated as follows. Equation! 17 1 gives us 7r t = noT*. 
Let no(vi) be the vector representing the initial probability 
distribution of being there at a particular vertex when we 
initially start the random walk from vertex i. Obviously in 
this case we know that we are at i and hence the value of 
7ro(u;) is given by the unit vector a (defined above). Hence, 
[7ro(wi),7ro(«2), • • • , 7To(«jv)] = [ei,62, • • • , en] = I, where I 
is the identity matrix. Hence, if we take 
P t = [TYt(vi),-Kt(v 2 ), - ■ ■ ,7r t («jv)], where n t (vi) be the vec- 
tor representing the probability distribution of reaching the 
vertices in t steps, when we initially start the random walk 
from vertex i we have 



Pi = IT 1 



Though widely used especially in the determination of rel- 
evance scores, they do have certain limitations. The non- 
symmetric nature of a directed graph can lead to problems 



= {D-'Af 



(22) 
(23) 



The relevance matrix P' given by the basic Random Walk 



model then is Equation) 22 1 at time t n such that 



P' = PtJ = Pt n ^'D- x A 



(24) 



On the other hand, we compute the influence matrix P as 
we have shown above is given by P = /3A(I — qA)" 1 . 

We can compute the influence score of the nodes relative 
the network using the influence matrix as done by Katz [7] . 
Taking pij as the influence scores of the nodes with respect 
to each other, i.e., Pij = Pij, we have pt — YljPij' Hence, 
the column vector p whose elements are pi gives the influence 
score of the nodes relative to the network. 

Recently researchers have applied probabilistic models, 
such as mixture models, to the community discovery task. 
The advantage of these models is that can probabilistically 
assign a node to more than one community, because, as it 
has been observed "objects can exhibit several distinct iden- 
tities in their relational patterns" [I] [8] . This indeed maybe 
true, but whether the nodes in the network is to be divided 
into distinct communities or probabilities with which each 
node belongs to community is to be discovered, really de- 
pends on the specific application. 

6. CONCLUSION AND FUTURE WORK 

We have proposed a new definition of a community in 
terms of the influence that nodes have on each other. We 
gave a mathematical formulation of influence in terms of 
the number of paths of any length that link two nodes, and 
redefined modularity in terms of the influence metric. We 
use the new definition of modularity to partition a network 
into communities. We applied this framework to networks 
well-studied in literature and found that it produces results 
at least as good as the edge-based modularity approach. 

Although the formulation developed in this paper applies 
equally well to directed graphs, we have only implemented 
the algorithm on undirected ones. Hence future work in- 
cludes implementation of the of the algorithm on directed 
graphs that are common on social networking sites, as well 
applying it to bigger networks. 

Leskovec et al. nOj state that they "observe tight but al- 
most trivial communities at very small scales, the best pos- 
sible communities gradually 'blend in' with rest of the net- 
work and thus become less 'community- like'." However the 
hypothesis that they employ to detect communities is that 
communities have "more and/or better-connected 'internal 
edges' connecting members of the set than 'cut edges' con- 
necting to the rest of the world." Hence, like most graph 
partitioning and modularity based approaches to commu- 
nity detection, their process depends on the local property 
of connectivity of nodes to neighbors via edges and is not 
dependent on the structure of the network on the whole. 
Besides, it also does not take into account the heterogeneity 
of node types, that is 'who' are the nodes that a node is 
connected to and how influential these nodes are. There- 
fore, we argue that a global property, such as the measure 
of influence, is a better approach to community detection. 
It remains to be seen whether communities will similarly 
'blend in' with the larger network if one uses the influence 
metric to discriminate them. 
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APPENDIX 



therefore 



Below we summarize the application of the leading eigenvec- 
tor method of Newman |13| to influence- based modularity. 
If we consider the division of the network into two commu- 
nities, then we could write Q as : 



W out W vn. +1 



w 



-)C- 



where 



(25) 



if vertex i £ group 1; 
-1 if vertex i G group 2. 

and s is a vector whose elements are s; and matrix C com- 

prises of elements dj such that Cy = Py ' w 3 . We 

symmetrize matrix C to get matrix B = C + C T . B is now 
called the modularity matrix, and we approximate modular- 
ity as 

Q = \\ 



(26) 



Hence if we want to divide the network in such a way that 
there is more than expected influence within the communi- 
ties, we would have to maximize the change in modularity 
due to subdivision. We note that before the initial divi- 
sion, i.e., taking the entire network, since all the elements 
belong to the same community or group the modularity is 
Q — "^ZijBij. Therefore, additional contribution AQ to 
modularity upon dividing subgroup g is: 



-1 

AQ= is T B (9) S 


(32) 


i i 


(33) 


AQ = i^aiwfB Cs) Ui 
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(34) 
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(35) 




(36) 



where A; is the eigenvalue of B' 9 ' corresponding to eigen 
vectors u^.The eigenvalues (and their corresponding eigen- 
vectors) are labeled in decreasing order of their magnitude 
ie.Ai > A2 > A3 > A4 > • ■ • 

Since we wish to maximize AQ hence we would like to 
choose the value of s such that maximum weight is concen- 
trated on the largest eigen values. The optimized solution 
would then be to choose s proportional to iti. However the 
constraint in choosing s in this manner is that s has an ad- 
ditional constraint that it can only be eiher 1 or -1. The 
approximation then used is similar to the one used spectral 
partitioning where all nodes whose corresponding elements 
in u\ are positive put in one group and the rest in the other 
group. 
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(27) 
(28) 
(29) 
(30) 



Bij — Sij ~}2 keg Bik and g is the entire network 
for the first division of the directed graph into two commu- 
nities Ci and C2. We can iteratively subdivide the resulting 
communities Ci and C2. AQ reflects the additional contri- 
bution to modularity of the entire network as the result of 
these subdivisions. If no further division increases modular- 
ity, we stop the process. The communities thus found are 
the optimal, or natural, communities within the network. 

Next we show that maximizing the modularity can be 
approximated using eigenvalue decomposition. We can write 
s as a linear combination of the normalized eigenvectors Ui 
of Hence 

S = 2^ a i u i (31) 



Hence at = uj .s 



